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[57] ABSTRACT 

An apparatus coupled to a requesting unit and a memory. 
The apparatus includes a data path and a request control 
circuit. The data path is coupled to the requesting imit and 
the memory. The data path is for buffering a vector. The 
vector includes multiple data elements of a substantially 
similar data type. The request control circuit is coupled to 
the data path and the requesting unit. The request control 
circuit is for receiving a vector memory request from the 
requesting unit. The request control circuit services the 
vector memory request by causing the transference of the 
vector between the requesting unit and the memory via the 
data path. 

23 Claims, 6 Drawing Sheets 
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LOAD ANEP STORE UNIT FOR A VECTOR requesting unit and receives a vector menaory request from 

PROCESSOR the requesting unit. The request control circuit services the 

vector memory request by causing the transference of the 

BACKGROUND OF THE INVENTION vector between the requesting unit and the memory via the 

1. Field of the Invention ^ datapath. . . ^ 

. , . . . _ J In a fiirther embodiment, the apparatus IS a load/store umt 

This mvention relates to computer systems and. more ^ ^^^^ processor. In yet a further embodiment, the 

parUcularly, to providmg load and store units for vector ^ ^^^^ ^^^^^^ j^^^ ^^.^ ^ ^^^^ 

processors within computer systems. embodiment, the apparatus is a vector store unit. 

2. Description of the Related Ait lo ,^ ^^^j,^^ embodiment of the invention, a computer 
A vector processor or single instruction multiple data system includes a vector processor, a memory and a load 

(SIMD) processor performs parallel calculations on multiple store unit. The load/store unit is coupled to the vector 

data elements that arc grouped together to from data struc- processor and the memory and includes a read/write daU 

tures called "vectors." Such vector processors arc well suited path and a request control circuit. The read/write daU path 

to multimedia processing which often requires a large num- 15 ^ buffering vectors. Each of the vectois inchides mul- 

bcr of identical calculations. For exanaple, for video coding, tiple data elements of a substantially similar data type. The 

a pixel map rcpresenting a video image often conUins request control circuit is coupled to the read/write data path 

thousands of pixel values, each of which must be processed and the vector processor. The request control circuit is for 

in the same manner. A single vector can oonUin multiple receiving vector memory requests from the vector processor, 

pixel values that the vector processor processes in parallel in 20 -j^^ ^quest control circuit services the vector memory 

a single clock cycle. Accordingly, the parallel processing requests by causing the transference of the vectors corre- 

powcr of a vector processor can rcduce processing time for sponding to the vector memory requests between the vector 

such repetitive task by a factor equal to the number of data processor and the memory via the read/write data path, 

elements in a vector. 1^ ^^^^^^^ embodiment, a method for loading and storing 

AU data elements in a veaor typically have the same data ^ unaligned vector data with variable data types includes: 

type, but vector processors may accommodate daU elements receiving a first vector request for a first vector, the fiist 

of different sizes in different vectors. For example, all of the vector request requesting access to a memory, the first vector 

data element in one vector may be 16-bit integers while the including multiple data elements of a substantially similar 

data elements in another vector are 32-bit floating point data type; determining whether the first vector request is 

values. Conventionally, data elements and processing cir- vector unaligned; generating a plurality of aligned vector 

cuits in vertor processors accommodate possible data widths requests corresponding to the first vector request if the first 

that are multiples of 8 because convention memories uses vector request is vector unaligned; providing the aligned 

addresses corresponding to 8-bit of storage. However, some vector requests to the memory; and servicing the aligned 

processing may be more efficiently done using a data width vector requests by the memory. 

"""^.^p!^ ^^^"^ P«>^^i°g according to another embodiment, a method for toading and storing 

theMPEGstandaids,forexample,mcludesa9-bitdatatype. unaligned vector data with variable data types includes: 

A vector processor that accommodates 16-data elements can receiving a first vector request for a first vector, the first 

process 9-bit values as 16-bit values, but that wastes much ^ t requesting access to a memory, the first vector 

of the data width and processmg power of the vector ^ including multiple data elements of a subsUntiaUy similar 

processor. ^j^^^^ ^yp^, determining whether the first vector request is 

Avector processor can be adapted to process odd size data vector unaligned; generating a plurality of aligned vector 

elements such as a 9-bit data type, if the internal data path requests corresponding to the first vector request if the first 

and execution units of the processor have the proper data vector request is vector unaligned; providing the aligned 

widths. However, if conventional memory is to be used, vector requests to the memory; and servicing he aligned 

functional units that access memory, such as a load/store unit vector requests by the memory. 

in the processor may need to convert vectors having odd size another embodiment, a method for loading and storing 

data elements that do not have a simple match to S-bit gj^-^^ ^^^^ data in an 8-bit byte memory system includes: 

storage locations. A load/store unit that eflBciently handles receiving a 9-bit byte memory request from a requesting 

vectors with odd data element sizes is sought. generating a pluraUty of 8-bit byte memory requests 

SUMMARY OF THE INVENTION from the 9-bit byte memory request; providing the phirality 

of 8-bit byte memory requests to a memory system; stonng 

A load/store unit within a vector processor services a plurality of 8-bit bytes corresponding to the 8-bit byte 

memory requests to load or store vectors (multiple data memory requests in the 8-bit byte memory system if the 

elements of a substantially similar type). Such an apparatus 55 9-bit byte memory request is a memory write; receiving a 

provides the advantage of handling vectors instead of indi- plurality of 8-bit byte results corresponding to the plurality 

vidual data elements. Further, such an apparatus provides the of 8-bit byte memory requests if the 9-bit byte memory 

advantage of performing parallel calculations using multi- request is a memory read; assembling the plurality of 8-bit 

media data structures, thereby increasing the performance of byte results into a 9-bit byte result if the 9-bit byte memory 

processors required to perform such calculations. go request is a memory read; and providing the 9-bit byte result 

In one embodiment of the invention, an apparatus is to the requesting device if the 9-bit byte memory request is 

coupled to a requesting unit and a memory. The apparatus a memory read. 

includes a data path and a request control circuit The data In another embodiment, a method for out-of-order loading 

path is coupled to the requesting unit and the memory. The and storing of vectors through the use of transaction ID tags 

data path is for buffering a vector. The vector includes 65 includes: a first step of receiving a first vector request from 

multiple data elements of a substantially similar data type. a first requesting device by a load/store unit; a second step 

The request control circuit is coupled to the data path and the of providing a first memory request corresponding to the 
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first vector request to a memory system by the load/store unit 08/699^97, attorney docket No. M-4355 US, filed on Aug. 
after the first step; a third step ofproviding a first transactioD 19, 1996, entitled "Single-Inslniction-Multiple-Data Pro- 
ID tag indicating the first memory request to the load/store cessing in a Multimedia Signal Processor," and naming Ix 
unit by the memory system after the second step; a fourth Trong Nguyen as inventor, and which is incorporated herein 
step of storing the first transaction ID tag by the load/store 5 by reference in its entirety. Processor 100 includes a general 
unit after the third step; a fifth step of receiving a second P^njose processor 110 coiyled to a vector processor 120. 
vector request from a second requesting device by the General ptirposeproceswrUO and vector processor 120 are 
load/store unit after the first step; a sixth step of providing <=^P^^ ^» ^ J^J- General purpose processor UO and 
a sec«nd memory request corresponding to the second I'^ir'^' ^ ^" 
vector request to L rn^mory system by the load/store unit 10 U6 and bus 118, respechvely. Cache system 130 con- 
after the fifth step; a sevenU. step of providing a second "°tr J*",*?? ""^^Z^) ? ^ 
transaction ID tag indicating tiie second memory request to 8«phi«lly as Modes 132 and 134). r^d only memory 
theload/storeuniTbythem7morysystemafterthes^step; (ROM) 136 and ci^he control logic 138. Cache ^tem IM 
aeighthstep of storing the second transaction ID Ugby the e^Sf^f"! ^ ^.t^. ? mstruction cache 
loacS^tore unit after the seventh step; a ninth step of re^eiv- is ""^^^^"^ cache 134A f" general purpose proce^r 
u,. tu^ - ««H 110, and (u) an instruction cache 132B and data cache 134B 
ing by the load/store unit a first memory request result and i. '^^^ ^-»A^t_ . ij.* 
fkf in fo„ ;f fi«^ ^^^r.Z. *^^..^of -ooi.it for vector processor 120. Cache system is coupled to mput/ 
the first transaction ID tag II the first memory request result ^ . /,rxT>wio\ ^oa jc /u m^iiov ^aa i^iiTTo 

is required by the first memory request, the ninth step being °^Py^ ^fJ"^ 

after the third step; a tenth step of receiving by the load/storf ^^^^P^"* V^^.^^^'.^u'^V asynchronous 

unit a second memory request result and the second trans- 20 '"*=flT ^"^ff^J 

action ID tag if the second memory request result is required and intertupt controller 188. FBUS 1^0 is coupled to device 

by the second memory request, the tenth step being a^r the 192, direct memory access (DMA) controller 194, 

e4hth step; an eleventh step of providing the fiiit request ^^^^ "^^^^ controUer 198. 

result and the first transaction ID tag to the first requesting General purpose processor UO and vector processor 120 

device by the load/store unit if the first memory request 25 separate program threads in parallel. General pur- 

resultisrequiredbythefirstveclorrequest,lheeleventhstep P<^ processor 110 typicaUy executes instrudions which 

being after the ninth step; and a twelfth step ofproviding the manipulate scalar data. Vector processor 120 typicaUy 

second request result and the second transaction ID tag to the executes instructions havmg vector operands, i.e, operands 

second requesting device by the load/store unit if the second ^ach conuining multiple data elements of the same type. In 

memory request result is required by the second vector 30 embodiments, general purpose processor 110 has a 

request, the twelfth step being after the tenth step. ^^^^^ ^^^tor processing capability and is generally suited 

for control operations. However, applications that require 

BRIEF DESCRIPTION OF THE DRAWINGS multiple computations on large arrays of data are not suited 

TTie present invention may be better understood, and its ^^^^ or even limited vector processing, 

numerous objects, features, and advantages made apparent 35 For example, multimedia applications such as audio and 

to those skilled in the art by referencing the accompanying video data compression and decompression often require 

drawings. many repetitive calculations on pixel arrays and strings of 

HG. 1 is a block diagram of a multimedia signal proccs- ^^^^ To perform real-time multimedia operations, a 

sor in accordant witii an embodiment of the invention, f P".'P°«^ processor which manipulates scalar data 

171^ ^ . . , . J. ^ . AQ (e.g. one pixel value or sound amphtude per operand) must 

HG. 2 is a blodc diagram of a vector processor in*"^*^ \ L-^iir i ^ -i * 

. , f *• operate at a high clock frequency. In contrast, a vector 

accordance with an embodiment of the mvention. ^ • . I j * 

„^ ^ . ^ ^« . . , , processor executes instructions where each operand is a 

HG. 3 is a diagram of a ZSS-bit data stream divided mto containing multiple data elements (e.g. multiple pixel 

shces according to an aspect of tiic mvention. ^^^^ ampUtudcs). Therefore, vector processor 

HG. 4 is a Uble showing various data types within a data 12O can perform real-time multimedia operations at a frac- 

slice for an aspect of the present invention. tion of the clock fircquency required for general purpose 

FIG. 5 is a block diagram of a load/store unit in accor- processor 110 to perform the same function, 

dance with an embodiment of the invention. gy allovring an efficient division of the tasks required by 

FIG. 6 is a diagram of a micro-instruction pipe for a applications that perform multiple computations on large 

load/store unit in accordance with an embodiment of the 50 arrays of data (e.g., multimedia applications) the combina- 

invention. tion of general purpose processor UO and vecUar processor 

FIG. 7 is a block diagram of read/write dato paths for a 120 provides a high performance per cost ratio. Although in 

load/store unit in accordance with an embodiment of the the preferred embodiment, processor 100 is for multimedia 

invention. applications, processor 100 may be used for other applica- 

TTie use of the same reference symbols in di£6erent draw- 55 tions. 

ings indicates similar or identical items. In one embodiment, general purpose processor 110 

nP^rBiPnoK np thp prpppudpfi executes a real-time operating system designed for a media 

UE^CRIFl ION Ob 1 HE^ ^^^jj communicating with a host computer system. 

EMBODIMENT'S) real-time operating system communicates with a pri- 

The following sets forth a detailed description of the 60 mary processor of the host computer system, services input/ 

preferred embodiments. The description is intended to be output (I/O) devices on or coupled to the media circuit 

illusu-ative of the invention and should not be taken to be board, and selects tasks which vector processor 120 

limiting. Many variations, modifications, additions and executes. In that embodiment, vector processor 120 is 

improvements may fall within the scope of the invention as designed to perform computationally intensive tasks requir- 

defined in the claims that follow. 55 ing the manipulation of large data blocks, v^iiile general 

FIG. 1 is a block diagram of a multimedia signal proces- purpose processor UO acts as the master processor to vector 

sor 100 such as described in U.S. patent application Sen No. processor 120. Although general purpose processor 1 10 and 
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vector processor 120 are od a media circuit board in ±e a single address space via cache system 130. Cache system 

preferred embodiment, such is not required by the invention. 130 couples to any of several memory mapped devices such 

In the exemplary embodiment, general purpose processor fj'^l^''"" processor 186, UART 184^ DMA <^boUer 

110 is a 32-bit RISC processor which operaies at 40 MHz l^: bus mterface 196 and a ooder^ecoder (CODEQ 
and conforms to the standard ARM7 instruction set The 5 d-^vK^ through d^ice interface 192. 

architecture for an ARM7 reduced instruction set computer ^ ^ preferred embodunent. I^^JJ^J^J^J 

j^rt^jr^ ... Anaii^* ^ J 1- J transaction-onented protocol to implemeDt a switchboard 

^k^aS'-S^T? ck'?'^^' kw """iS ^ ^kp for data access amonj the load/store units of the processors 

m the ARM7DM Data Sheet available from Advan(*dR^ ^ ^ ^ ^^^^ ^ 

Machines Ltd. and which is incorporated herem by refer- ^^,rtionK)riented protocol provides that if completion of 

ence in its entirety. General purpcee processor 110 also ^„ j^jj.^, ^^^^ „g^ction is delayed (e.g., due to a cache 

implements an extenaon of the ARM7 mstrucUons set ^^^^ ^^^^ uansactions may proceed prior to 

which inc udes instructions for an interface with vector ^^^j^tion of the initial transaction. Cache system 130 and 

processor 120 ITie extension to the ARM? mstruction set ^ y^^f^,^ infract to implement a "step-aside-and- 

for the exemplary embodunen of the inveoUonjs d^^^ ^^j^. biUty for memory accesses. This capability is 

in copending U.S patent aroh^tion Ser. No. 08/699,295. « .^^ ^^^^^ ^ transaction IDs. When a load/ 

»"°™y t*^" No M.4366 US. filed on Aug. 19, 1996, ^ ^ ^^^^ 13^ ^^^^^ 

entitled "System and Method for Handling Software Inter- ^^^^^ ^ ^ transaction ID to 

nipts with Argument Pa«mg, and naming Seungyoon Peter ^ ,^ ^^^^ 

Song. Moataz A. Mohamed Heon-aul Park and Le ^f the request (e.g., if the request was a cache read), the 

Nguyen as mventors. and which is mcorporated heiem by 20 the requesting load/fetore unit with the 

reference m its entirety. General purpose processor 110 corresponding transaction ID to identify to which of the 

coupled to vertor processor 120 by bus 112 to carry out the ^^^^ ^ responsive. The transaction-oriented 

extension of the ARM? instruction set Bus 112 mcludes an i„;;,,,ction of cache system 130 and the load/store units of 

interrupt line tor vector processor LJ« to request an interrupt ^roc^ssors, and cache system 130 in seneral, are 
on general purpose processor 110. 25 ^^^^ copending U.S. patent application Ser. No. 

In the exemplary embodiment, vector processor 120 is the oa/751,149, filed on Nov. 15, 1996, issued on Jan. 12, 1999 

digital signal processing engine of a multimedia signal as U.S. Pat. No. 5,860,158 entiUed "Cache Control Unit with 

processor. Vector processor 120 has a single-instruction- ^ Cache Request Transaction-Oriented Protocol" naming 

multiple-data (SIMD) architecture and manipulates both Yet-Ping Pai and Le Trong Nguyen as inventors, which is 

scalar and vector quantities. Vector processor 120 consists of incorporated herein by reference in its entirety. A similar 

a pipeUned reduced instruction set computer (RISC) pro- transaction-oriented protocol for a shared bus system is 

cessing core 102 that operates at 80 MHz and has a 288-bit described in pending, U.S. patent application Ser. No. 

vector register file. Each vector register in the vector register 08/731393, attorney docket No. M^398 US. filed on Oct. 

file can contain up to 32 data elements. A vector register can ^995^ entitled "Shared Bus System with Transaction and 

hold thirty-two 8-bit or 9-bit ("byte9«format) integer data Destination ID," naming Amjad Z. Qureshi and Le Trong 

elements (i.e., 8-bit bytes or 9-bit bytes), sixteen 16-bit Nguyen as inventors, which is incorporated herein by ref- 

integer data elements, or eight 32-bit integer or floating point eience in its entirety. 

elements. AdditionaUy, the exemplary embodiment can also Referring again to FIG. 1, cache system 130 couples 
operate on a 576-bit vector operand spanning two vector ^ g^^^^^j p^^^ processor 110 and vector processor 120 to 

two system busses: lOBUS 180 and FBUS 190. lOBUS 180 

The instruction set for vector processor 120 includes typically operates at a slower frequency than FBUS 190. 

instructions for manipulating vectors and for manipulating Slower speed devices arc coupled to lOBUS 180, while 

scalars. The instruction set of exemplary vector processor higher speed devices arc coupled to FBUS 190. By sepa- 

120 and an architecture for implementing the instruction set rating the slower speed devices from the higher speed 

is described in the copending U.S. patent application Ser. devices, the slower speed devices arc prevented from unduly 

No. 08/699,597, attorney docket No. M-4355 US, filed on impacting the performance of the higher speed devices. 

Aug. 19, 1996, entitled "Single-Instniction-Multiple-Data c^che system 130 also serves as a switchboard for 

Processing in a Multimedia Signal Processor," and naming communication between lOBUS 180, FBUS 190, general 

Le TYong Nguyen as inventor (incorporated by reference purpose processor 110, and vector processor 120. In most 

above). embodiments of cache system 130, multiple simultaneous 

General purpose processor 110 performs general tasks and accesses between the busses and processors are possible. For 

executes a real-time operating system which controls com- example, vector processor 120 is able to communicate with 

munications with device drivers. Vector processor 120 per- FBUS 190 at the same time that general purpose processor 
forms vector tasks. General purpose processor 110 and 55 110 is communicating with lOBUS 180. llie combination of 

vector processor 120 may be scalar or superscalar proces- the switchboard and caching function is accomplished by 

sors. The multiprocessor operation of the exemplary using direct mapping techniques for FBUS 190 and lOBUS 

embodiment of the invention is more fully described in 180. Specifically, the devices on FBUS 190 and lOBUS 180 

pending U.S. patent application Ser. No. 08/697,102, attor- can be accessed by general purpose processor 110 and vector 
ney docket No. M-4354 US, filed on Aug. 19, 1996, entitled 50 processor 120 by standard memory reads and write at 

"Multiprocessor Operation in a Multimedia Signal appropriate addresses. 

Processor," namiiig Le Trong Nguyen as inventor, which is pBUS 190 provides an interface to the main memory. The 

incorporated herein by reference in its entirety. interface unit to the memory is composed of a four-entry 

Referring again to FIG. 1, in an embodiment of a com- address queue and a one-entry write-back latch. The inter- 
puter system according to the invention, general purpose 65 face can support one pending refill (read) request from 

processor 110 and vector processor 120 share a variety of general purpose processor instruction cadie 132A, one 

on-chip and off-chip resources which are accessible through pending refill (read) request firom vector processor instruc- 



06/06/2003, EAST Version: 1.04.0000 



5,961,628 

7 8 

tion cache 132B, one write request from vector processor embodiment, load/store unit 250 may be coupled to any type 

data cache 134B, and one write-back request from vector of memory capable of servicing load aixJ store requests for 

processor data cache 134B due to a dirty cache line. data stored or to be stored in the memory. Ijoad/storc unit 

In the preferred embodiment, FBUS 190 is coupled to ^50 may be coupled to a memory hierarchy system 
various highspeed devices. For example, memory controller 5 (•^eluding, e.g.. pnmary and secondary c««=hes tnain 

IAD ^ £ I I e 1 1 memory Circuits and mass storage devices) or a component 

198 provides an interface for a ocal incmory if a lo^\ ^^^J ^^^^^^^ ^ ^^^.^^ ^^^^^^^ 250 is 

memory is provided for processor 100. DMA controUer 194 preferred embodiment for facilitating both loads 

controls direct memory accesses between the main menaory ^^^^ ^^ ^^ ^lock boundaries are merely illustrative, 

of a host computer and the local memory of processor ICO. altemative embodiments may merge logic blocks or 
Local bus interface 196 provides an interface to a local bus lo i^jpose an alternative decomposition of functionality. For 

coupled to a host processor. DMA controUer 194 and local example, a stort: unit for facilitating only store or write 

bus interface 196 are well known in the art. Device interface instructions may be provided in accordance with the invcn- 

192 provides a hardware interface to various devices such as ^^^^ ^ load unit for facflitating only load or read 

digital-to-analog and analog-to-digital converters (DACs instructions may be provided in accordance with the inven- 

and ACDs, respectively) for audio, video and commimica- jJqq 

tions applications. Instruction fetch unit 210 is responsible for prefetching 

In the preferred embodiment, lOBUS 180 operates at a instructions from instruction cache 132 and processing con- 

friequency (40 MHz) lower than the operating frequency (80 trol flow instructions such as branch and jump-to-subroutine 

MHz) of FBUS 190, and is coupled to various devices. For instructions. Instruction fetch unit 210 conUins a 16-entiy 

example, system timer 182 (e.g., a standard Intel 8254 queue of prefetched instructions for the current execution 

compatible interval timer) interrupts general purpose pro- stream and an 8-entry queue of prefetched instructions for 

cesser 110 at scheduled intervals. UART 184 is a serial the branch target sUeam. Instruction fetch unit 210 can 

interface (e.g., a 16450 UART compatible integrated circuit) receive 8 instructions from instruction cache 132 every 

for use in modem or facsimile applications which require a cycle. 

standard serial communication ("CX^M") port. Bitstream ^ Instruction decoder and issuer 220, 230 are responsible 
processor 186 is a fixed hardware processor which performs foj. decoding and scheduling instructions. Instruction 
specific functions (e.g., initial or final sUges of MPEG decoder 220 can decode one instruction per cycle in the 
coding or decoding) on an input or output bitstream. An ^rder of instruction arrival from instruction fetch unit 210 
exemplary embodiment of bitstream processor 186 is ^ud break each instruction into a plurality of micro- 
described in pending U.S. patent application Ser. No. instructions. Each micro-instruction is associated with a 
08/699303 now U.S. Pat. No. 5,746.952, attorney docket functional unit in execution data path 240 or load/store unit 
No. M-4368 US, filed on Aug. 19, 1996. entiUed "Methods 250. Instruction issuer 230 can schedule micro-instructions 
and Apparatus for Processing Video Data," naming Cliff for execution depending on execution resources and operand 
Reader, Jae Cheol Son, Amjad Qureshi and Le Nguyen as ^^^^^ availability. Instruction issuer 230 issues micro- 
inventors, which is incorporated herein by reference in its instructions indicating load or store operations to load/store 
entirety. Interrupt controller 188 controls interrupts of gen- ^^^^ 250 

eral purpose processor 110 and supports multiple interrupt Vector-processor 120 achieves much of its performance 

pnonties (e.g., according to the standard Intel 8259 interrupt ^^^^^^ ^^^ous 288-bit data paths running at 12.5 ns/cycle 

40 in execution unit 240. The execution unit 240 data paths 

The above referenced devices coupled to lOBUS 180 and include the following: a four-ported register file that sup- 

FBUS 190 in the preferred embodiment are described in ports 2 reads and 2 writes per cycle; eight 32x32 parallel 

copending U.S. patent application Ser No. 08/699.597 multipHers that can produce every 125 ns either eight 32-bit 

(incorporated by reference above), and in copending U.S. multiplications (in integer or floating point format), sixteen 

patent application Ser. No. 08/751,149, filed on Nov. 15, i5.bit multiplications or thirty two 8-bit multiplications; 
1996, issued on Jan. 12, 1999 as U.S. Pat No. 5,860,158, , and, eight 32-bit arithmetic logic units (ALUs) that can 

entitied "Cache Control Unit with a Cache Request produce every 12.5 ns either eight 32-bit ALU operations (in 

Transaction-Oriented Protocol," naming Yet-Ping Pai and integer or floating point format), or sixteen 16-bit ALU 

Le Trong Nguyen as inventors (also incorporated by refer- operations or thirty two 8-bit or 9-bit operations. The 

ence above). execution unit 240 data paths are described in cofilcd U.S 

Referring to FIG. 2, vector processor 120 includes patent application Ser. No. 08,790,142, attorney docket No. 

instruction fetch unit 210, instruction decoder 220, instruc- M-4679 US, entitled "Execution Unit Data Paths for a 

tion issuer 230, execution unit 240 and load/store unit 250. Vector Processor, which is incorporated herein by reference 

Instruction fetch unit 210 is coupled to cache system 130 and in its entirety. 

instruction decoder 220. Instruction decoder 220 ts coupled 55 FIG. 3 shows the format of a 288-bit data stream 300, 

to instruction fetoh unit 210 and instruction issuer 230. which is divided into eight 36-bit slices 310 according to one 

Instruction issuer 230 is coupled to instruction decoder 220, embodiment of the invention. Each slice 310 can accom- 

execution unit 240, and load store unit 250. Execution unit modate multiple data types. In one embodiment, shown in 

240 is coupled to instruction issuer 230 and load/store unit FIG. 4, each slice 310 handles one 32-bit data word, two 
250. Load/store unit 250 is coupled to instruction issuer 230, 50 16-bit data words, four 9-bit data words, or four 8-bit data 

execution unit 240 and cache system 130. words. The 36-bit data slice is described in copending U.S. 

Although load/store unit 250 is coupled to instruction patent application Ser. No, 08/749,619, attorney docket No. 

issuer 230 and to execution unit 240 in the preferred M-4601 US, entitled "Adder which Handles Multiple Data 

embodiment, load/store unit 250 may be coupled to any type with Different Data Types," filed Nov. 18, 1996, which is 
of vector processing unit or other requesting unit capable of 65 incorporated herein by reference in its entirety, 

requesting load or store operations. Also, although load/store Load/store unit 250 is designed to interface with data 

unit 250 is coupled to cache system 130 in the preferred cache 134 through separate read and write data buses each 
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of which is 256-bits wide. Load/stoie unit 2S0 executes load 
and store micro-instructions by handling the interactions 
between execution unit 240 and cache system 130. 

Referring to FIG. 5, load/store unit 250 includes control 
logic 500 and read/write data path 550. Control logic 500 
includes instruction pipe 510, address data path 520, request 
control logic 530 and read/write control logic 540. Read/ 
write data path 550 includes read data path 560 and write 
data path 570. Instruction pipe 510 includes instruction 
queue 512, address queue 514 and grant queue 516. The 
input and output signals of load/store imit 250 arc defined in 
Tables 1, 2 and 3. 

TABLE 1 



Name 



Instiuction [ssue Interface. 
Definition 



lsu_read_staU Asserted to indicate that, at the neH cycle, the 

load/store unit (LSU) cannot icceivc another load or 
cache operation instiuctioa A load/store unit 
ULstnicttoa being issued in the current cycle is not 
affected by this signal. 

Uu_write_staU Asserted to indicate that, at the next cyde, the LSU 
cannot receive another store instruction, An LSU 
instruction being issued in the current cycle is not 



10 



TABLE 1-continued 



Name 



Instruction bsne Interface. 
Definition 



cxe_lsu_valid 



10 



exc_bu_op_cod6 
[17:0] 



IS 



20 



affected by this signal. 

Asserted for one cycle to indicate that a toad, store 
or a cache operation instruction ia being issued, in 
this cyde, to the LSU. This signal is asserted in the 
cycle the load/store unit instruction is in the read 
state. 

^lid only in the cycle exe_Uu_valid is asserted. 
The format of the 16 bits are: 
[17:13] opcode (same as in the instruction forniat) 
[12] v^en set, asserts cache off 
[11:9] data type (same as in the instruction format) 
[8] when set, indicates a vector destination register 
[7] whca set, indicates the alternate bank 
[6:2] register number 

[1] when set indicates that the store data is given to 
load/store unit 250 in the next cycle. Otherwise, it is 
given in the current cycle. 
rfile_Isu_data Sec CKc_Jsu_op„code(l] for timing. 
[287:0] 

alu_Jsu_addr[31:0} Vilid one cycle after eKe_lsu_valid is asserted. 
This is the effective address. 



TABLE 2 



Data Cache Interface. 



Name 



Definition 



ccu_^grant_id[9:0] Indicates which unit may request to cache control unit for which 
data in current cycle. 
[9:6] Unit ID. 
[5:0] IVansaction ID 

lsu_req Asserted for one cycle to request data cache access in current 

cyde. 

lsu_rw ^%Ud only when Isu^^req is asserted. When asserted indicates a 

load request; when dcasserted indicates a store request. 
lso_ccu_off \%lid only when lsu_req is asserted. Asserted to indicate a 

cache off access. For cache operation, this signal is dcasserted. 
lsu_adi[31:0] Data cache access address. Nblid only when lsu_req is asserted. 

Isu_vec_type[l:0] Vilid only in the cycle teu_req is being asserted. Indicates size 

of the access: 

Ox scalar 

10 vector access that is within a 64-byte boundary 

11 vector access that crosses a 64-byte botudary 
lsu_data_type(2:0] \%Ud only in the cycle lsu_req is being asserted Indicates the 

data type of a store access (since load alignment is handled 

within the LSU). 

OOx byte store to 8-bits 

Olx byte 9 store to 8-faits by tiuncatiiig upper 1 bit 
lOx halfword store to 16-bits 

110 byte 9 store to 8-bits by zero-extending upper 7 bits 

in word store to 32-bits 
ccu_lsu_j-d_hold 2 This is a ph2 signal. Asserted to indicate that the read request 

being made in the current cyde is not being accepted. 
ccu_lsu_wr_hold__2 This is a ph2 signal. Asserted to indicate that the write request 

being made in the current cyde is not being accepted. 
ccu_Uu^hit 2 This is a ph2 signal. Asserted one cyde after tsu_req to 

indicate cache hit 

ccu_lsu_wr_grani Asserted to indicate that the store data can be sent to the CCU 
via ccu_din during the next cyde 

ccu__din[143:0] This is a double-pumped bus. The lower 144 of 288 bits aie 

sent in phi and the upper 144 bits ate sent in ph2. For a scalar 
register store, all 32 bits are sent via ccu__din[31K}]. 

ccu_data_Jd[9:0] This bus contains signal used for returning a cache miss. The 
encoding? are identical to ccu^rant^id. The identified unit is 
to match the transaction ID with the grand ID it has saved when 
the request was made. If the match occurs, the data is to be 
received on the ccu_dout but during the next cyde. 

ccu_dom( 127:0] This is a double-pumped bus. The lower 128 of 256 bits are 

sent in phi and the upper 128 bits are sent in ph2. For a scalar 



06/06/2003, EAST Version: 1.04.0000 





5,961,628 




11 




TABLE 2-conlinued 




Data Cache Inter&ce. 


Name 


Definition 




rcgiister load, all 256 bits are sent so that the LSU can select the 




right bits (8, 9, 16, 32, 256 or extend to 288). 



10 



TABLE 3 



Name 



Result Bus Interface, 
Defimtioa 



15 



tsu_rfile_wt_rcq 



l$u_rfile_wt^^t[4:a] 



Asserted for oae cycle to indicate that a 
load data will be returned during 
the next cycle. 

Uu_rfile_wt_addif6K)] ^lid only when Isu rfile_wt_req is 

asserted. This is the destination register 20 
number. The encodings are: 
[6] set to indicate a vector register 
[5] set to indicate the alternate bank 
[4:0] indicates the register number 
\%Iid only when lsu_ifllc_wt^_rcq is 
asserted Indicates the first byte from 25 
which the returning load data is to be 
written. This is to handle partial 
register file write for unaligned load. 
lsn_rfile_wt^jium_byte[4:0] \Wid only when lsu__ifile_wt^rcq is 
asserted Indicates the number of 
bytes (starting from the 
lKU_rfile_wt^oflLjset) to write for the ^" 
returning load data. This is to h&ndle 
partial register file write for 
unaligned load 

Uu_rfile_dout(2d7:0] Vhlid only during one cycle after 

lsu_rfile_wt^re<j is asserted. 
Load result bus to the write stage. 35 



Referring to FIG. 6, instruction queue 512 and address 
queue 514 each have a depth of 4 instructions and addresses, 
respectively. Instruction pipe 510 receives micro- 
instructions and addresses from instruction issuer 230 and 
execution unit 240, and makes requests to cache system 130. 
The micro-instructions are received via exe_Jsu_op_codc 
[16:0]. The micro-instructions are stored by instruction pipe 
510 in the 4-entry deep instruction queue 512. Address 
queue 514 (not shown in FIG. 6) stores addresses corre- 45 
sponding to the instructions stored in instruction queue 512. 
In the preferred embodiment, instmction pipe 510 uses four 
control lines to determine the position of the next micro- 
instruction in instruction queue 512. Hie four control lines 
are instO_load, instl_load, inst2_load and inst3_load. If 5Q 
instruction queue 512 is full, load/store unit 250 sends a stall 
signal (e.g., lsu_read_stall) to instruction issuer 230 to 
prevent issue of an instruction requiring a micro-instruction 
that load/store unit 250 cannot currently accept. Each entry 
of instruction queue 512 contains 17 bits to store a micro- 
instruction, a bit for storing a micro-instruction valid flag, 
and a bit for storing a micro-iostruction clear flag. A request 
ID is stored in grant queue 516 along with a request ID valid 
bit. The top request ID in the grant queue controls multi- 
plexer 630 which selects the next instruction for presentation 
to cache system 130. 60 

Address data path 520 selects a cache address from either 
address queue 514 or a new incoming address calculated in 
execution unit 240. The new address comes from execution 
unit 240 simultaneously with a corresponding micro- 
instruction issued by instruction issuer 230. 65 

Request control logic 530 requests access to cache system 
130 according to a selected micro-instruction in the instruc- 



55 



tion queue of instruction pipe 510 and monitors interface 
signals from cache control unit 138 which controls access to 
cache system 130. Request control logic 530 also contiok 
how cache access is requested depending on the format of 
the load/store data. 

If the load and store request address provided by address 
path 520 to cache control unit 138 is vector-unaligned, i.e., 
the address is not a multiple of 32 bytes, the request is 
broken down to two separate aligned requests. The first 
request uses the original starting address with the lowest 5 
bits as zeros. The second request uses the first address plus 
32. The actual data requested by the micro-instruction 
straddles these two requests. Read/write data path 550 
assembles the requested 288-bit vector from the 256-bit 
quantities returned by cache system 130. 

When loading an aligned vector having 9-bit data 
elements, the cache load request is also broken down to 2 
separate cache requests. The first request uses the original 
starting address, ^or aligned vectors the 5 lowest address 
bits are zeros.) The second request uses the first address plus 
32. 9-bit data elements are stored in memory as 16-bit values 
with the 7 most significant bits being zeros. Accordingly, 
one 288-bit vector having 9-bit data elements occupies 512 
bits (or 64 bytes) of memory. Read/write data path 550 
removes the upper 7 bits from each set of 16-bits in two 
256-bit values returned. 

When storing a vector having 9-bit data elements, only 
one store request is sent to cache control unit 138. Cache 
control unit 138 extends each 9-bit byte into two 8-bit bytes 
(or 16 bits) by filling the extra 7 bits with zeros. Two 256-bit 
values are written to cache system 130. 

Read/write control logic 540 controls queues in readAvrite 
data path 550 and generates control signals for data align- 
ment in read/write data path 550. Specifically, read/write 
control logic 540 controls a read buffer queue in read data 
path 560 and a write buffer queue in write data path 570. 

Read data path 560 receives data from cache system 130 
when a load request is granted and a read transaction ID is 
issued to load/store unit 250 by cache control unit 138, when 
there is a data cache hit corresponding to the load request, 
and when the request data ID in the returned data matches 
with the corresponding read transaction ID previously 
granted to load/store imit 250 by cache control unit 138. 
Read data path 560 also performs data alignment of 
unaligned load data according to the vector data type being 
used. For example, if the request address is unaligned, the 
requested data vector will straddle two 256-bit vectors from 
data cache 134 and a byte aligner in read/write data path 550 
selects the requested data bytes and aligns the data bytes in 
a 288-bit format for execution xmii 240. The resulting data 
is sent to the register file through a read buffer. The read 
buffer has a depth of two entries. 

Write data path 570 receives data from the register file and 
transfers the data to cache system 130 when a load/store unit 
store request is granted by cache control unit 138. In this 
block the write buffer has a depth of four entries. Cache 
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system 130 contains circuitry which converts a 288-bit control logic 330 monitors the incoming data ID lines from 

vector into a format for storage in cache memory. Signal cache control unit 138. When the incoming data ID matches 

lsu„data_type[2:0] indicates a data type and therefore the a transaction ID previously saved in the grant ID queue of 

type of conversion required. instruction pipe 51®, read data path 560 receives the read 
In operation, load/stoie unit 250 performs load and store 5 data from data cache 134 during the cycle following the 

instructions for vector processor 120. Load/store unit 250 cycle in which the match was indicated. Instruction pipe 510 

handles up to 4 pending micro-instructions from instruction then removes the current request from both the instruction 

issuer 230 and corresponding effective addresses calculated queue and the grant ID queue. 

by execution unit 240. Vectors are provided to load/store ^, ^ ,rx • nc * a n /• *u * • 

unit 230 from the register file (e.g.f execution unit 240). ^"^^^^^ ^ 75 percent fiiU (i.e. three entries 

Load/store unit 250 «ceives a micro-instruction from are used and only one unused entiy is left) load^store urn 

instruction issuer 230 when an instruction being executed ^50 stops naalong requests to «che conUol unit 138 and 

requires a memory access. When instruction issuer 230 waits untd the grant ID queue reduces to 25 percent full (i.e. 

asserts an instruction valid signal (drives cxc_Isu_vaUd c^^^y is used and three entries arc available) so that 
high), the address being issued from execution unit 240 to , won't miss a transaction ID frona cache control unit m 

load/store unit 250 (exe_Jsu„op_code[16:0]) is valid and TransacUon IDs could be pipehned, so load/store unit 250 

load/store unit 250 receives the micro-instruction into the s^ps making requests to cache control umt 138 when the 

instruction queue of instruction pipe 510. Otherwise, load/ S^ant ID queue is 75 percent fuU. 

store unit 250 ignores all incoming micro-instructions as If there is a cache hit or a match between a transaction ID 
being invalid. A micro-instruction that is present at the input and data ID, the requested micro-instruction is selected from 
of load/store unit 250 is invalid, e.g., if it it has already been the instruction queue and saved to the relevant micro- 
processed. The exe_Jsu_valid signal indicates to load/store instruction register (i.e., 17 bit read opcode register rd„op_ 
unit when to read its opcode input. code[16:0] for a load or 17 bit write opcode register 
If the instruction queue is fiill, instruction pipe 510 asserts wr_op_code[16:0] for a store). The micro-instruction in the 
a read-staU signal (lsu_read_sUll high). In this case, opcode register is used to control read or write daU paths 
instruction issuer 230 will not issue new micro-instructions 560, 570, to interface signals to the register file of execution 
until the read-stall signal is deasserted (lsu_read_j5tall low). tmit 240. 

When the instruction queue contains at least one micro- Referring to FIG. 7, an exemplary embodiment of read/ 

instruction and grant signal ccu__grant permits a request, write data path 550 is shown. Read/write data path 550 

request control logic 530 requests cadie access by asserting includes read data path 560 and write data path 570. Read 

request signal lsu_req and drive signal lsu_adr according to data path 560 includes byte aligner 710 and read data buffer 

a value from address data path 520. Signal lsu_wr is 7C0. Byte aligner 710 is coupled to cache system 130 and 

asserted (or not) to identify the request as a load (or a store). read data buffer 700. Read data buffer 700 is coupled to byte 

If cache control unit 138 deasserts the read or write hold aligner 710 and execution unit 240. Write data path 570 

signal, cache control unit 138 accepts the request. If the includes write data buffer 750, multiplexer 760, multiplexer 

cache control hold signal is asserted (high), request control 770 and multiplexer 780. Write data buffer 750 is coupled to 

logic 530 must retry the request and assert signal lsu_jeq. execution unit 240 and multiplexers 760 and 770. Multi- 

For a store request, when request control logic 530 plexeis 760 and 770 are coupled to write data buffer 750 and 

receives a write grant signal, request control logic 530 muhiplexer 780. Multiplexer 780 is coupled to multiplexers 
asserts signal lsu__request, and if cache control unit 138 ^ 760 and 770 and cache system 130. Write data buffer 750 

deasserts signal ccu_lsu_wr„boId, request control logic includes two data buffers, data buffer 752 and data buffer 

530 asserts signal lsu_adr according to the value from 754, which are coupled to muhiplexer 760 and multiplexer 

address data path 520. Read/write control logic 540 directs 770, respectively. 

write data path 570 to send 288-bits of store data to cache Write data buffer 750 receives 288-bits over a 288-bit bus 
system 130 at the next CLKl and CLK2 phases (144 ^5 from execution unit 240. Data buffer 752 receives 144 bits, 

bits/phase) after cache control unit 138 asserts signal ccu_ and data buffer 754 receives 144 bits. Multiplexer 450 of 

lsu_wr_^rant. Instructionpipe 510 then removes the current write data path 570 is used to select data from data buffer 

store request micro-instruction from the instruction queue to 752. Multiplexer 770 is used to select data from data buffer 

make room for a new incoming micro-instruction from 754. Multiplexer 780 is used to alternately select data buffer 
execution unit 240. 5q 752 or data buffer 754. The bus connecting multiplexer 780 

For a load, request control logic 530 monitors the read with cache system 130 is a 256-bit logical bus. The bus is a 

hold signal ccu_Jsu_/d__hold_2 from cache control uiut double-pumped bus. Multiplexer. 780 provides 128 bits in a 

138. Cache control unit 138 deasserts the read hold signal to first phase and 128 bits in a second phase, 

grant read access. Instruction pipe 510 updates the instruc- During an unaligned load (i.e., a load from an address that 
tion queue and may process another load or store request, if 55 is not a multiple of 256), read data path 560 performs the 

available, while waiting for the load data. necessary alignment so that unaligned data may be extracted 

When request control logic 530 detects the deasserted from an aligned 256-bit value. Additionally, byte ahgner 710 

read hold signal from cache control unit 138, instruction aligns data bytes from a 256-bit value in a 288-bit vector as 

pipe 510 saves a transaction ID in a grant ID queue. When illustrated in FIGS. 3 and 4. 

cache control unit 138 asserts cache hit signal ccu_lsu_ 60 For example, in a 9-bit byte load, load/store unit 250 

hit_2 following the cycle of the read grant signal, read data receives a 9-bit byte load request from execution unit 240. 

path 560 receives read data from data cache 134, and Since the memory system is based on 8-bit bytes, loading a 

instruction pipe 310 removes the current request from both 9-bit byte (byte 9) requires two 8-bit bytes or 16 bits, 

the instruction queue and the grant ID queue of instruction Therefore, request control logic 530 generates two 8-bit byte 
pipe 510. 65 load requests to cache control unit 138. The first request uses 

If the cache hit signal is not asserted (ccu_lsu_hit__2 is the original starting address with the lowest 5 bits as zeros, 

low) following the cycle of the read grant signal, request and the second request address is the first request address 
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phis 32. When the requested information is returned £rom 
cache system 130, the upper 7 bits of the upper 8-bit byte are 
removed by byte aligner 710. The lower 1-bit of the upper 
8'bit byte is then combined with the lower 8-bit byte and 
provided to read buffer 700 for transfer to execution unit 
240. With this process, loading a vector with thirty two 9-bit 
bytes therefore requires sixty four 8-bil bytes from data 
cache 134. 

[n each unaligned 8-bit, 16 bit, or 32-bit vector load, 
load/store unit 250 receives a 32-byte load request for 32 
bytes stored at an address that is not a multiple of 32 £rom 
execution unit 240. Since the memory system loads by 
32-byte lines, loading 32 bytes that are unaligned requires 
two 32-byte loads. Therefore, request control logic 530 
generates two 32-byte load requests to cache control imit 
138. The first request uses the address of the first line to be 
requested with the lowest 5 bits as zeros. The second request 
address is the first request address plus 32. When the 
requested information is returned firom cache system 130, 
the unneeded portions (each of which is an 8 bit, 16-bit or 
32-bit segment depending on the type of imaligned vector 
load) of the first request is removed by byte aligner 710. The 
needed portion is then combined with the needed portion of 
the second request and provided to read buffer 700 for 
transfer to execution unit 240. With this process, loading a 
vector with thirty two 8-bit bytes therefore requires sixty 
four 8-bit bytes from data cache 134. 

When storing a byte9 vector (with a length of thirty-two 
9-bit bytes), load/store unit 250 receives a 9-bit byte vector 
store request from execution unit 240. Since the memory 
system is based on 8-bit bytes, storing a 9-bit byte (byte9) 
requires two 8-bit bytes or 16 bits. However, only one store 
request is sent to cache control unit 138. Cache control \mit 
138 extends the 9-bit byte into two 8-bit bytes (or 16 bits) 
and fills the extra 7 bits of the upper 8-bit byte with zeros. 
An advantage of doing store alignment in cache system 130 
is the availabihty of data elements in aligned cache lines that 
are preserved by unaligned stores that affect only some of 
the data elements in the cache line. 

While the invention has been described with reference to 
various embodiments, it will be understood that these 
embodiments are illustrative and that the scope of the 
invention is not limited to them. Many variations, 
modifications, additions, and improvements of the embodi- 
ments described are possible in accordance with the inven- 
tion as claimed. Those skilled in the art will recognize that 
alternative embodiments may be implemented in agreement 
with the present invention. For example, those skilled in the 
art will recognize that boimdaries between logic blocks are 
merely illustrative. That is, alternative embodiments may 
merge logic blocks or impose an alternative decomposition 
of functionality. For example, the various logic blocks of 
load/store unit 250 may be structured in different ways. 
Specifically, a load unit and a store unit may be provided in 
accordance with the invention. Also, separate or combined 
data paths may be provided in accordance with the inven- 
tion. Moreover, alternative embodiments may combine mul- 
tiple instances of a particular component, or may employ 
alternative polarities of signals. For example, control signals 
may be asserted low instead of high, as in the preferred 
embodiment. Other implementations of instruction, address 
and data pipelines, and read/write control logic and the 
busses connecting them may be used to create a load/store 
unit 250 according to the present invention. Other imple- 
mentations of memory or cache systems, bus interfaces and 
busses, and processors may be used to create a computer 
system according to the present invention. These and other 
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variations, modifications, additions, and improvements may 
fall within the scope of the invention as defined in the claims 
which follow. 
What is claimed is: 

1. A load/store imit coupled to a requesting imit and a 
memory, the load/store unit comprising: 

a read/write data path coupled to the requesting unit and 
the memory, the data path for buffering a vector, the 
vector including a plurality of data elements of a 
substantially similar data type; and 

a request control circuit coupled to the data path and the 
requesting unit, the request control circuit for receiving 
a vector memory request from the requesting unit, the 
vector memory request for requesting transference of 
the vector between the requesting device and the 
memory, the request control circuit servicing the vector 
memory request by causing the transference of the 
vector from the requesting unit to the memory via the 
read/write data path if the vector memory request is a 
store request, and by causing the transference of the 
vector from the memory to the requesting unit via the 
readAvrite data path if the vector memory request is a 
load request, wherein the request control circuit com- 
prises: 

an instruction pipe for receiving at least one vector 
memory request from the requesting unit, the 
instruction pipe including an instruction queue for 
buffering the at least one vector memory request, the 
instruction queue including a plurality of entries for 
storing the at least one vector memory request; 

an address data path coupled to the instruction pipe and 
the requesting unit, the address data path for select- 
ing a memory address from one of the instruction 
pipe and the requesting unit; 

a request control block coupled to the address data path, 
the request control block generating at least one 
memory access request for each of the at least one 
vector memory request, the request control block 
controlling the manner of memory access according 
the data type of the vector; and 

a read/write control blodc coupled to the request con- 
trol block, the read/write control block for control- 
ling the readAvrite data path. 

2. The apparatus of claim 1, wherein the instruction pipe 
further comprises a grant queue for storing a plurality of 
transaction IDs, the instruction queue including a plurality 
of entries for storing the plurality of transaction IDs, eadi of 
the plurality of transaction IDs identifying each of the at 
least one memory access request. 

3. The apparatus of claim 1, wherein: 
the request control block 

generates a phirality of aligned vector requests corre- 
spondmg to each of tbe at least one vector memory 
request that is vector tmaHgned, and 

generates a plurality of 8-bit byte memory requests 
corresponding to each of the at least one vector 
memory request if the at least one vector memory 
request is for a 9-bit byte vector data type; and 
the read/write data path 

assembles a plurality of results provided by the 
memory in re^onse to each of the at least one vector 
memory request, tbe plurality of results being 
assembled into a first vector result, if the at least one 
vector memory request is for an unaligned vector 
read, and 

assembles a plurality of 8-bit byte results provided by 
the memory in response to the at least one vector 
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memory request, the plurality of 8-bit byte results 
being assembled into a 9-bit byte result, if the at least 
one vector memory request is for a 9-bit byte 
memory read. 

4. The apparatus of claim 1, wherein the requesting unit $ 
comprises a functional data execution unit 

5. The apparatus of claim 1, wherein the requesting imit 
comprises an instruction issuer circuit 

6. The apparatus of claim 1, wherein the memory com- 
prises a cache system. 

7. A load/store unit coupled to a requesting unit and a 
memory, the load/store unit comprising: 

a read/write data path coupled to the requesting unit and 
the memory, the read/write data path for buffering a 
vector, the vector including a plurality of data elements 
of a substantially similar data type; and 
a request control circuit coupled to the read/write data 
path and the requesting unit, the request control circuit 
for receiving a vector memory request from the 
requesting unit, the vector memory request for request- 20 
ing transference of the vector between the requesting 
device and the memory, the request control circuit 
servicing the vector memory request by causing the 
transference of the vector from the requesting unit to 
the memory via the read/write data path if the vector 25 
memory request is a store request, and by causing the 
transference of the vector from the memory to the 
requesting unit via the read/write data path if the vector 
memory request is a load request, wherein: 
the request control circuit is coupled to the memory, the 30 
request control circuit for providing memory access 
requests to the memory; 
the memory includes a memory control circuit for 
receiving the memory access requests &om the 
request control circuit, for providing a unique trans- 35 
action ID for each memory access request to the 
request control circuit; and 
the request control circuit includes a grant queue for 
storing each transaction ID, the request control cir- 
cuit receiving a result for each memory access 40 
request and a corresponding transaction ID identify- 
ing to which request the result corresponds if the 
vector memory request is a vector load request. 

8. The apparatus of claim 7, wherein the requesting unit 

is a functional data execution unit. 45 

9. The apparatus of claim 7, wherein the requesting unit 
is an instruction issuer circuit. 

10. The apparatus of claim 7, wherein the memory is a 
cache system. 

U. An apparatus coupled to a requesting imit and a 50 
memory, the apparatus comprising: 

a data path coupled to the requesting unit and the memory, 
the data path for buffering a vector, the vector including 
a pliu'ality of data elements of a substantially similar 
data type; and 55 

a request control circuit coupled to the data path and the 
requesting unit, the request control circuit for receiving 
a vector memory request from the requesting unit, the 
vector memory request for requesting transference of 
the vector between the requesting device and the 60 
memory, the request control circuit servicing the vector 
memory request by causing the transference of the 
vector between the requesting unit and the memory via 
the data path; wherein: 

the data path includes a read data path for buffering the 65 
vector to be transferred from the memory to the 
requesting unit; 



628 

18 

the request control circuit is coupled to the read data 
path and the requesting unit, the request control 
circuit for receiving a vector memory read request 
from the requesting unit, the request control circuit 
servicing the vector memory read request by causing 
the read data path to receive the vector from the 
memory and to provide the vector to the requesting 
unit; 

the request control circuit is coupled to the memory, the 
request control circuit for providing memory access 
requests to the memory; 
the memory includes a memory control circuit for 
receiving the memory access requests from the 
request control circuit and for providing a unique 
transaction ID for each memory access request to the 
request control circuit; and 
the request control circuit includes a grant queue for 
storing each transaction ID, the request control cir- 
cuit receiving a result for each memory access 
request and a corresponding transaction ID identify- 
ing to which request the result corresponds if the 
vector request is a vector load request. 

12. A computer system comprising: 
a vector processor; 
a memory; and 

a load unit, the load unit coupled to the vector processor 
and the memory, the load unit comprising: 
a read data path for buffering a vector, the vector 
including a plurality of data elements of a substan- 
tially similar data type, the read data path providing 
the vector to the requesting unit responsive to receiv- 
ing the vector from the memory; and 
a request control circuit coupled to the read data path 
and the vector processor, the request control circuit 
for receiving a vector memory request from the 
vector processor, the request control circuit servicing 
the vector memory request by causing a transference 
of vectors corresponding to the vector memory 
request between the vector processor and the 
memory via the read data path; 
wherein 

the memory has a data width; 
the read data path includes: 
a byte aligner, the byte aligner receiving the vector 
from the memory, the vector having a data type, 
the data type having a data type width, the byte 
aligner assembling the vector if the vector data 
type width is greater than the memory data width; 
and 

a read data buffer for buffering the vector, the read 
data buffer receiving the vector from the byte 
aligner and providing the vector to the vector 
processor. 

13. A computer system comprising: 
a vector processor; 
a memory; and 

a load/store unit, the load/store unit coupled to the vector 
processor and the memory the load/store unit compris- 
ing: 

a read/write data path for buffering a vector, the vector 
including a plurality of data elements of a substan- 
tially similar data type; and 
a request control circuit coupled to the read/write data 
path and the vector processor, the request control 
circuit for receiving a vector memory request from 
the vector processor, the request control circuit ser- 
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vicing the vector memory request by causing a 
transference of vectors corresponding to the vector 
memory request between the vector processor and 
the memory via the readAvrite data path; 
wherein 

the load/store unit is packaged within the vector pro- 
cessor; 

the vector processor includes a requesting unit; and 
the request control circuit comprises: 

an instruction pipe for receiving a plurality of 
requests from the requesting unit, the instruction 
pipe including an instruction queue for buffering 
the plurality of requests, the instruction queue 
including a plurality of entries for storing the 
plurality of requests; 

an address data path coupled to the instruction pipe 
and the requesting unit, the address data path for 
selecting a memory address from one of the 
instruction pipe and the requesting unit; 

a request control block coupled to the address data 
path, the request control block requesting access 
to the memory for each of the plurality of requests 
firom the requesting unit, the request control block 
controlling the manner of memory access accord- 
ing the data type of the vector; and 

a readAvrite control block coupled to the request 
control block, the read/write control block for 
controlling the readAvrite data path. 

14. A load/store unit comprising: 
a vector processor; 

a memory; and 

a load/store unit, the load/store unit coupled to the vector 
processor and the memory, the load/store unit compris- 
ing: 

a read/write data path for buffering a vector, the vector 
including a plurality of data elements of a substan- 
tially similar data type; and 
a request control circuit coupled to the readAvrite data 
path and the vector processor, the request control 
circuit for receiving a vector memory request from 
the vector processor, the request control circuit ser- 
vicing the vector memory request by causing a 
transference of vectors corresponding to the vector 
memory request between the vector processor and 
the memory via the readAvrite data path; wherein 
the request control circuit is coupled to the memory, 
the request control circuit for providing memory 
access requests to the memory; 
the memory includes a memory control circuit for 
receiving the memory access requests from the 
request control circuit and for providing a unique 
transaction ID for each memory access request to 
the request control circuit; and 
the request control circuit includes a grant queue for 
storing each transaction ID, the request control 
circuit receiving a result for each memory access 
request and a corresponding transaction ID to 
identify to the request control circuit to which 
request the result corresponds if the vector request 
is a vector load request. 

15. A method for loading and storing unaligned vector 
data with variable data types, the method comprising: 

receiving a first vector request for a first vector, the first 
vector request requesting access to a memory, the first 
vector including a plurality of data elements of a 
substantially similar data type; 
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determining whether the first vector request is vector 
unaligned; 

generating a plurality of aligned vector requests corre- 
sponding to the first vector request if the first vector 
request is vector unaligned; 
providing the aligned vector requests to the memory; and 
servicing the aligned vector requests by the memory; 
wherein 

the first vector includes a first portion and a second 
portion; 

the step of generating a plurality of aligned vector 
requests includes: 

providing a first aligned vector request corrrespond- 
ing to a first vector-width portion of the memory, 
the first vector-width portion of the memory 
including the first portion of the first vector; 

providing a second aligned vector request corre- 
sponding to a second vector-width portion of the 
memory, the second vector-width portion of the 
memory including the second portion of the first 
vector; 

the first vector request is a vector store request; and 
the step of servicing the aligned vector requests com- 
prises storing the first vector, the first portion of the 
first vector stored in a first aligned vector, the second 
portion of the first vector stored in a second aligned 
vector. 

16. A method for loading and storing imaligned vector 
data with variable data types, the method comprising: 
receiving a first vector request for a first vector, the first 

vector request requesting access to a memory, the first 

vector including a plurality of data elements of a 

substantially similar data type; 
determining whether the first vector request is vector 

unaligned; 

generating a plurality of ahgned vector requests corre- 
sponding to the first vector request if the first vector 
request is vector imaligned; 
providing the aligned vector requests to the memory; and 
servicing the aligned vector requests by the memory; 
wherein 

the first vector includes a first portion and a second 
portion; 

the step of generating a plurality of aligned vector 
requests includes 

providing a first aligned vector request correspond- 
ing to a first vector-width portion of the memory, 
the first vector-width portion of the memory 
including the first portion of the first vector; 

providing a second aligned vector request corre- 
sponding to a second vector-width portion of the 
memory, the second vector-width portion of the 
memory including the second portion of the first 
vector; 

the first vector request is a vector load request; 
the step of servicing the aligned vector requests com- 
prises: 

receiving a plurality of results corresponding to the 
aligned vector requests from the memory; 

assembling the plurality of results into the first vector 
if the vector load request is for unaligned vector 
data; and 

providing the first vector to the requesting unit; and 
the step of receiving the plurality of results includes: 
receiving a first vector-width result including a first 
portion and a second portion, the second portion 
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of the first vector-width result including the first 
portion of the first vector; and 

receiving a second vector-width result including a 
first portion and a second portion, the first portion 
of the second vector-width result inchiding the 5 
second portion of the first vector; and 

the step of assembling the plurality of results into the 
first vector includes combining the second portion 
of the first vector-width result with the first portion 
of the second vector-width result to provide the lo 
first vector. 

17. A method for loading and storing 9-bit byte data in an 
8-bit byte memory system, the method comprising: 

receiving a 9-bit byte memory request from a requesting 
device; 

generating a plurality of 8-bit byte memory requests 

corre^onding to the 9-bit byte memory request; 
providing the plurality of 8-bit byte memory requests to a 

memory system; 
storing a plurality of 8-bit bytes corresponding to the 8-bit 

byte memory requests in the 8-bit byte memory system 

if the 9-bit byte memory request is a memory write; 
receiving a plurality of 8-bit byte results corresponding to 

the plurality of 8-bit byte memory requests if the 9-bit 25 

byte memory request is a memory read; 
assembling the plurality of 8-bLt byte results into a 9-bit 

byte result if the 9-bit byte memory request is a 

memory read; and 
providing the 9-bit byte result to the requesting device if 

the 9-bit byte memory request is a memory read. 

18. The method of claim 17, further comprising: 
receiving the provided plurality of 8-bit byte memory 

requests by the memory; 
providing a unique 8-bit byte transaction ID for each of 
the plurality of 8-bit byte memory requests received by 
the memory; 

storing each of the provided 8-bit byte transaction IDs in 
a grant queue; and 40 

receiving a plurality of corresponding 8-bit byte transac- 
tion IDs with the plurality of 8-bit byte results if the 
9-bit byte memory request is a memory read, the 
transaction IDs identifying which of the plurality of 
8-bit byte memory results corresponds to each 8-bit 
byte requests. 

19. The method of claim 17, wherein 

the 9-bit byte includes an upper 1 bit and a lower 8 bits, 
and 

the step of generating a plurality of 8-bit byte memory 
requests from the 9-bit byte memory request includes 
providing a first 8-bit byte memory request correspond- 
ing to an 8-bit memory byte including the upper 1 bit 
of the 9-bit byte, and 
providing a second 8-bit byte memory request corre- 
sponding to an 8-bit memory byte including the 
lower 8 bits of the 9-bit byte. 

20. The method of claim 19, wherein: 

the step of receiving the plurality of 8-bit byte results 
includes: 
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receiving a first 8-bit byte result, the first 8-bit byte 
result including an upper 7 bits and a lower 1 bit, the 
lower 1 bit being the upper 1 bit of the 9-bit byte, and 

receiving a second 8-bit byte result, the second 8-bit 
byte result including the lower 8 bits of the 9-bit 
byte; and 

the step of assembling the plurality of 8-bit byte results 
into a 9-bit byte result includes combining the lower 1 
bit of the first 8-bit byte with the second 8-bit byte 
result to provide the 9-bit byte result. 

21. A method for out-of-order loading and storing of 
vectors through the use of transaction ID tags, comprising: 

a first step of receiving a first vector request firom a first 
requesting device by a load/store unit; 

a second step of providing a first memory request corre- 
sponding to the first vector request to a memory system 
by the load/store unit after the first step; 

a third step of providing a first transaction ID tag indi- 
cating the first memory request to the load/store unit by 
the memory system after the second step; 

a fourth step of storing the first transaction ID tag by the 
load/store unit after the third step; 

a fifth step of receiving a second vector request &om a 
second requesting device by the load/store unit after the 
first step; 

a sixth step of providing a second memory request cor- 
responding to the second vector request to the memory 
system by the toad/store unit after the fifth step; 

a seventh step of providing a second transaction ID tag 
indicating the second memory request to the load/store 
unit by the memory system after the sixth step; 

an eighth step of storing the second transaction ID tag by 
the load/store unit after the seventh step; 

a ninth step of receiving by the load/store tmit a first 
memory request result and the first transaction ID tag if 
the first memory request result is required by the first 
memory request, the ninth step being after the third 
step; 

a tenth step of receiving by the load/store unit a second 
memory request result and the second transaction ID 
tag if the second memory request result is required by 
the second memory request, the tenth step being after 
the eighth step; 

an eleventh step of providing the first request result and 
the first transaction ID tag to the first requesting device 
by the load/store unit if the first memory request result 
is required by the first vector request, the eleventh step 
being after the ninth step; and 

a twelfth step of providing the second request result and 
the second transaction ID tag to the second requesting 
device by the load/store unit if the second memory 
request result is required by the second vector request, 
the twelfth step being after the tenth step. 

22. The method of claim 21, wherein the memory system 
is a cache system. 

23. The method of claim 22, wherein the first requesting 
unit and the second requesting unit are the same unit. 

« « « 0 « 
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